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Abstract: This paper studies Bayesian variable selection in linear models 
with general spherically symmetric error distributions. We propose sub- 
harmonic priors which arise as a class of mixtures of Zellner's g-priors for 
which the Bayes factors are independent of the underlying error distribu- 
tion, as long as it is spherically symmetric. Because of this invariance to 
spherically symmetric error distribution, we refer to our method as a ro- 
bust Bayesian variable selection method. We demonstrate that our Bayes 
factors have model selection consistency and are coherent. We also develop 
Laplace approximations to Bayes factors for a number of recently studied 
mixtures of g-priors that have appeared in the literature (including our 
own) for Gaussian errors. These approximations, in each case, are given by 
the Gaussian Bayes factor based on BIC times a simple rational function 
of the prior's hyper-parameters and the R 2 's for the respective models. We 
also extend model selection consistency for several g-prior based Bayes fac- 
tor methods for Gaussian errors to the entire class of spherically symmetric 
error distributions. A simulation study and an analysis of two real data sets 
indicates good performance of our robust Bayes factors relative to BIC and 
to other mixture of g-prior based methods. 
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1. Introduction 

Suppose the linear regression model is used to relate Y to the p potential pre- 
dictors Xi, . . . , x p , 



where the subscript F refers to the full model A4p. In the model (1.1), a is an 
unknown intercept parameter, 1„ is annxl vector of ones, Xp — (x±, . . . , x p ) 
is an n x p design matrix, and f3p is a p x 1 vector of unknown regression 
coefficients. In the error term of (1.1), op is an unknown scalar and ep has a 
spherically symmetric (SS) distribution with the density /(||ej?|| 2 ), E[ep] = 0„ 
and Var[e^] = J„. We assume that the columns of Xp have been standardized 
so that for 1 < i < p, x\\ n — and x^Xi/n = 1. 
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We shall be particularly interested in the variable selection problem where 
we would like to select an unknown subset of the effective predictors. It will be 
convenient throughout to index each of these 2 P possible subset choices by the 
vector 

7 = (7i)-">7p)' 

where ji = or 1. We use q~ = 7'l p to denote the size of the 7th subset. The 
problem then becomes that of selecting a submodel of (1.1) 

y = al n + X 1 (3 1 + cr 7 e 7 . (1.2) 

In (1.2), Xj is the n x q 7 matrix whose columns correspond to the 7th subset 
of xi, . . . , x p , /3 7 is a q-y x 1 vector of unknown regression coefficients. Let M. 1 
denote the submodel given by (1.2). We assume the error term e 7 has the SS 
density 

e 7 ~/(|| e7 || 2 ) (1.3) 

for all 7, with -E[e 7 ] = 0„ and Var[e 7 ] = I n . Further er 7 is an unknown scalar in 
the error term. We note that, in most earlier studies, the error terms in linear 
models have been assumed to have a Gaussian distribution, e.g. as in George 
and Foster (2000) and Liang et al. (2008). 

In this paper, we assume that n > p + 1 (the so called classical setup) and 
{xi, . . . , x p } are linearly independent, which implies that 

rank Xp = p, rank X 7 = q~ r (1-4) 

We also assume in much of the paper that the null model Mn {q-y = or 
7 = (0, . . . , 0)') is not a possible model, that is, the number of possible models 
is 2 P — 1, rather than, 2 P , although in Section 3, we indicate an alternative 
development which allows all TP possible models. 

A Bayesian approach to this problem entails the specification of prior distri- 
butions on the models 7r 7 = Pr(A^ 7 ), and on the parameters a,/3 7 ,cr 7 of each 
model. For each such specification, of key interest is the posterior probability of 
M 1 given y, 



7T 7 771 7 (y) 7T 7 BF 7 

X; 7 7r 7 TO 7 (y) J] 7 7r 7 BF 7^' 



Pr(A<» = = "T-y , (1-5) 



where tin = Pr(A'JAr) = is assumed as mentioned in the above (although 
see Remark 3.3). In (1.5), m 1 (y) is the marginal density under A^ 7 and BF T j? 
is the Bayes factor for comparing each of /VJ 7 to the full model M.p which is 
defined as 



_ m 7 (y) 
m F (y) 



BF^.p — 



where mi?(y) is the marginal density under the full model. In Bayesian model 
selection, 

argmaxPr(jM 7 |y) = argmax7r 7 m 7 (y) = argmax7r 7 BF 7: ^ , (1-6) 

7 77 
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is typically selected as the best model. 

In this paper, the main focus is on BF T i?, not 7r 7 . Hence our main aim is to 
propose and study specifications for the prior distribution of the parameters for 
each submodel M. 1 . In particular, the joint density we consider for VW 7 has the 
form 

p 7 (a,0 7 ,a» oca--"- 1 ||0 7 ||-^+ !/ (1.7) 

for 7 = (6*i, . . . ,0 qy Y = (X 7 X 7 ) 1 / 2 /3 7 and a non-random hyper-parameter v 
with < v < q 7 . Since the term including <? 7 in the prior above, ||0 7 ||~ 9 "' +I ' for 
< v < min(2, gu,), is known as a sub-harmonic function, that is, 

Eti{d 2 /d6f}\\ej-^>o, 

we call the prior given by (1.7) a sub-harmonic prior. We will show that such 
priors lead to robust Bayes factors, in the sense that each Bayes factor does not 
depend on the form of the underlying error distribution. We also show that the 
resulting procedure has model selection consistency and is also coherent. 

The organization of this paper is as follows. In Section 2, we give details of 
the prior distribution. In Section 3, we show that the Bayes factor with respect 
to the above prior is given by 

BF 7:F (^) =BF° F (v) (1.8) 



where 



BpG ( , = j ' 9*-Hl+ 9) + {9{l -&,) + !}-*- dg 

' ' ' ./'V- l,: 1 • g) ■ {.'/ ,: i R%) H} <&' 



for < v < min 7 q 7 = 1. In (1.9), BF 7 . F (t/) is the Bayes factor for standard 
Gaussian errors and i? 7 and Rp are the coefficient of determination under the 
submodel A4 7 and the full model M. p, respectively. From (1.8), the Bayes factor 
does not depend on the same SS sampling density. Hence, even when there is no 
specific information about the form of the error distribution of each model (other 
than spherical symmetry), it is not necessary to specify the exact form of the 
sampling density. It suffices to assume it is Gaussian. As far as we know, in the 
area of Bayesian variable selection with shrinkage priors or Zellner's g-priors, the 
sampling density has been assumed to be Gaussian and this kind of robustness 
result has not yet been studied. Originally analogous robustness results were 
derived by Maruyama (2003), Maruyama and Strawderman (2005) and others 
in the problem of estimating regression coefficients with the Stein effect and 
it was this type of result that lead us to search for a similar phenomenon in 
the context of model selection. Note that we use the term "robustness" in this 
sense of distributional robustness over the class of SS error distributions. We 
specifically are not using the term to indicate a high breakdown point. The use of 
the term "robustness" in our sense is however common (if somewhat misleading) 
in the context of insensitivity to the error distribution in the context of shrinkage 
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literature. In Section 4, by use of the Laplace approximation, we approximate 

r — G 

the Bayes factor given by (1.8) as BF 7:F (^) s» BF 7:F (zy) 

BF 7:F (^) = BF^ F BIC 1.10 

ip(p - i/, 1 - i? F ) 



as n — >■ oo where 

p(s,r) =rs s - 1 {(l/r-l)e}- s 
and BF 7 . F [BIC] is the BIC based alternative for standard Gaussian errors 



BF^ F [BIC] = 



(1 - R^)- n n- q -y 1 1/2 
(1 - R 2 F )- n n-P j 



(See, e.g. Hastie, Tibshirani and Friedman (2009), Chapter 7.) Since (/?(s, r) does 
not depend on n, (1.10) is asymptotically equivalent to BIC with a simple 0(1) 
rational correction function depending upon v as well as the i?-squares and the 
numbers of predictors. Actually this is a special case of Theorem 4.1 in which 
several Bayes factors under Gaussian errors which have been proposed in earlier 
studies, are shown to have similar asymptotic approximations. While the main 
theme in this paper is to develop the relationship (1.8) under sub-harmonic 
priors, we believe that this asymptotic equivalence is another noteworthy con- 
tribution, in particular, from the computational point of view. In Section 5, we 
show that our Bayes factor has model selection consistency uniformly over the 
class of SS error distributions, as n oo and p is fixed. It also follows from 
these results that several model selection methods recently studied in the liter- 
ature for Gaussian errors have model selection consistency for the entire class 
of SS error distributions. We provide illustrations of the method and compar- 
isons with other methods using both simulated and real data in Section 6. We 
give concluding remarks in Section 7. The Appendix presents some of the more 
technical proofs. 



2. Prior distributions 



In this section, for each submodel, we give a prior joint density of a form 

p(a,/3 7 ,7? 7 ) =p(a)p(?7 7 )p(/3 7 |?7 7 ), 

where rj^ — 1/cr?. The natural choice of priors for location (a) and scale (rj 7 ) 
are 

Pai. a ) = 1 (-00,00) (a), (2- 1 ) 

and 

P^(%) = Vy^io^M- (2-2) 

In (2.1) and (2.2), the superscript 1 means that the prior density is improper. 
Since (2.1) and (2.2) have invariance to location and scale transformation, re- 
spectively, they are considered by many as non-informative objective priors. 



Y. Maruyama and W. Straw derman/B ayes factors 



5 



The problem is that they are improper and hence determined only up to an 
arbitrary multiplicative constant. In this paper, the use of improper priors is 
justified through sequences of proper priors approaching the target improper 
priors (2.1) and (2.2): 

Pa(a;h a ) = -^-I(-h a ,h a )i a ) ( 2 - 3 ) 

where h a — > oo and 

/ i \ 1 -it / x hi/1*,, h^iVy) ,„ 

Mw ' v) = Yju ^ " = ^bgA^ (2 ' 4) 

where h„ — > oo. See the beginning of Section 3 for details of the justification. 

Next we give a sequence of proper conditional priors on /3 7 given ?7 7 , which 
approach an improper conditional prior on /3 7 given 77 7 : 

poo 

Pl3\r ] {P 1 \{Vi^};h g )= I p g {g\v]h g )(t) q _ i {l3 1 \0 1 g^ 1 {X' 1 X 1 )- 1 )dg (2.5) 
JO 

where v and h g are non-random positive parameters. Further 

Pg (gW;h g ) = {vMh-^g^-H^ig), (2.6) 

and (j) q (-\n, S) denotes the g-variate Gaussian density with mean vector fi and 
covariance matrix X. The prior (2.5) clearly has a hierarchical structure and 
it can be interpreted a scale mixture of Zcllncr's g-priors. Similar priors have 
been considered by Liang et al. (2008) and Maruyama and George (2010) under 
the Gaussian linear regression setup. See Sub-Section 2.1 below for a review of 
priors on g. For any fixed h g > and v > 0, the prior p | gi r/ (y9 7 |{?7 7 , v}; h g ) is 
clearly a proper probability density, that is, 



W>'< 



When h g — > oo and < v < q 7 , the limit of a variant of p / 3|^(/3 7 |{?7 7 , v\\ h g ) is 
given by 

PftriiPili 1 !-/^}) 



\ h" /2 } 

, lim { ""To" tPl3\v(Py\{ T h> l '}\ h 9) 

h g — too Vj Z 



, |A A 7 | r? 7 / v \ 

9 (2^/2^/2 eX ? {- Yg^ X -< X ^ ) ^ 

r({g 7 - 4/2) 



2^/2^/2 



x;x 7 | 1 /2( /3 ;jf;x 7 /3 7 )-^-)/2 ?? -/2. 
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In summary, the proper prior joint density under M. 1 , which we will use in this 
paper, is given by 

p 7 (a,/3 7 ,?? 7 |z/;h.) =p a (a]h a )p n (r] 1 ]h v )p (3 \ ri (l3 1 \{ri 1 ,v}-,h g ) (2.8) 

with h — (h a , h n , h g Y , which clearly satisfies 

/OO /> />OG 
/ / p 7 (a, /3 7 ,rj^\v;h)dad/3ydr)j = 1, 
-oo Jn q -< Jo 

for any fixed h and any v > 0. In Section 3, we will use the improper joint 
density of a, (3 1 and ry 7 given by 



p 7 (a, /3 7 ,77 7 |z/) = piCa^C^^k^K^' ^}) 
= lim V(Ai, ^)p 7 (a, /3 7 , ?7 7 |^; ft.) 

_ r({g 7 -^}/ 2 ) |y/ y l 1 ^' x'X /3 H^-^/V'/ 2 - 1 



(2.9) 



where < v < q 1 and 

= {2h a }{2\ogh v }{(2/v)h»/ 2 } = (8/u)h a {logh n }h v / 2 . (2.10) 

Hence, in a certain sense, the larger v is (as long as <j 7 - o > 0), the more 
objective the joint prior p^(a,/3 7 ,?7 7 |^) is. 

In this presentation of the improper joint density pi{a, /3 7 , ry 7 |^), two facts, 

Kl. (a,/3 7 ) and ?7 7 are separable, 

K2. the part depending on r/^ is given by the power function r/^ 2 1 , 

will be the key for calculating the marginal density in the next section. See also 
Remark 3.1. 

If, in the above joint prior on (a,/3 7 ,?7 7 ), we make the change of variables, 
# 7 = (X^,X 7 ) 1 / 2 /3 7 , the joint prior of (a,0 7 ,?7 7 ) becomes 

tiWv'hW) = ^^ ll^ll-^^ 2 - 1 - (2-11) 

As noted in Section 1, the part depending on # 7 , H&yll - ^ -1 ^ for < v < 
min(2,<7 7 ), is known as a sub-harmonic function, that is, 

E SsPJ-^-^ = («7 - ^)(2 - «/)||e 7 ||-^-"5- a > o. 
»=i * 

2. 1 . Review of priors on g 



As noted above, the prior given by (2.5) is a scale mixture of Zellner's g-priors. 
Actually the original Zellner's g-priors were used for the Gaussian linear re- 
gression setup and historically the hyperparameter g has been a priori fixed or 
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somehow estimated. The first paper to effectively use a prior on g was Zcllner 
and Siow (1980); they stated things in terms of multivariate Cauchy densities, 
which can always be expressed as a mixture of g-priors. Here we review the prior 
on g, the second stage of <?-priors for Gaussian linear regression. We hope that 
it makes the positioning of our prior on g, (2.6), which is for SS errors, clearer. 
As a generalization of p g {g\v; h g ) given by (2.6), consider the prior on g, 

Pg (9\{»,k};h g ) = Vg{{v , k} . hg) • ( 2 - 12 ) 

(so p g {g\{v,0};h g ) = p g (g\v;h g )). In (2.12), k > 0, v > -k, h g > and 
V g {{v, k}; h g ) = f g v ' 2 -\l + g-T k/2 dg. 



Note that it is improper at g = when v < —k. As we will see in Section 3 and 
4, Bayes factors are not well-defined when the prior on g is improper at g = 
and that is why v > — k is assumed. 

When k > and —k < v < 0, without truncating by h g the prior of g is 
proper under the support g £ (0, oo). In this case, the normalizing constant 
becomes 

V g ({is, k}; oo) = B(-u/2, {v + fc}/2). 

Further even when < v < q 1 (non-negative v implies impropriety at g = oo), 
the Bayes factor under M. 1 is well-defined as shown in Section 3. 

As in (2.5) and (2.7), we are considering the (improper) case where < v < q 1 
and k = 0. Liang et al. (2008) considered the (proper) case where —2 < v < 
and k = 2 — v. Guo and Spcckman (2009) and Celeux et al. (2011) considered 
the (improper) case where v = and k = 2. 

Remark 2.1. The noteworthy difference between k > and k = is that the 
(proper or improper) prior on /3 7 given rjy, 

lim V g ({v,k};h g ) [ 9 pMi^k^h^i^grj-^X'X^dg 

h g -yoa J 

= Jo (l + 9-^ 2 (2ng)^ 6XP {~ 2g^ X ^ ) * 

(2.13) 

is not separable for any k > 0. As will be seen in Section 3, the separability 
of the prior is the key for calculating the marginal density for SS error models 
other than Gaussian. 

Remark 2.2. Here we discuss objectivity (or at least non-subjectivity) of the 
prior in terms of hyper-parameters of the prior on g. Consider the (proper or 
improper) prior on /3 7 given ?y 7 , which is given by (2.13). In order to obtain 

1 /2 

the asymptotic behavior of the density as rjj ||/3 7 || — > oo, we appeal to the 



Y. Maruyama and W. Straw derman/B ayes factors 



8 



Tauberian theorem for the Laplace transform (see Geluk and de Haan (1987)). 
Since (1 + g~ x ) k / 2 — >• 1 for any k as g — > oo, 

hm P^\^M) = F({ g - „}/2) 1/2 

when v < q~. Hence the asymptotic order of (2.14) is the same as (2.7). The 
larger v{< q 7 ) is, the more objective the prior Pp\ v (f3'y\'>]'yi { v i k}) is- 

3. Marginal density and Bayes factor under sub-harmonic priors 

In this section we derive the marginal density under each submodel and the 
Bayes factor for comparing each M. 1 to the full model A4p. The marginal density 
of y under A4 7 , is given by 

/oo p /*oo 
/ / < /2 /(^ll2/-«ln-^ 7 /37l| 2 ) 
-oo Jr"-< JO (3.1) 

x p 7 (a, /3 7 , rj-ylu; h)da df3 7 dr]^, 

where the proper joint prior p 7 (a, /3 7 , ?y 7 |^; h) is given by (2.8). In (3.1), h does 
not depend on the submodel, but is the same in all models. In the following, 
instead of m 1 {y\v ) h) directly, we consider the limit of the product of m 7 (y|^; h) 
and V(h, u), 

Mj(y\v) = lim V(h, v)m 1 (y\u] h) 

/OO /• /'OO 
/ / ^ /2 f(v4y-^i n -x^\\ 2 ) (3.2) 
-oo Jri-i Jo 

x p 7 (a, /3 7 , rjyl^da d/3-y dr)-y, 

which is the marginal density of y with respect to the improper prior (a , /3 7 , ry 7 1 v) 
given by (2.9): 

i, a I \ r({g 7 -,}/2) IX'X,^ 2 - 1 
P >,/3 7 ,,» = 2 , /27r?T/2 ( ^ x , X7/37)(9t -,)/ 2 - ( 3 - 3 ) 

The second equality in (3.2) follows from the monotone convergence theorem. 
Since V(h, v) does not depend on A4 7 if we choose the same v in all sub- 
models, the Bayes factor m 1 (y\iy; h)/rriF{y\v, h) approaches M 1 {y\v)/Mp{y\v) 
as h — > oo. Hence the use of the improper joint prior is justified as long as 
M^(y\v) / M p{y\v) is well-defined. As remarked in Section 1, the null-model is 
not a possible model. Since there is no (3 and hence no h g , Mm (y\v) / Mp{y\v) is 
not well-defined. And thus ttn — Pr(A^Ar) ~ is assumed. However, see Remark 
3.3 for an alternative development which allows Mn as a possible model. 

Let M° (y\v) be the marginal density under M.^ with standard Gaussian 
errors €q. Before proceeding with the entire calculation of the marginal density, 
My(y\i>), when e 7 has a general SS distribution, we will provide a relationship 
between M 1 {y\v) and M^{y\v) as follows. 
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Lemma 3.1. Let v be between and q 7 . Assume the existence of E[\\e^\\ v ]. 
Then 

M ^ v) = fjfelFI< (2/i!/) - (3 ' 4) 

Proof. See Appendix. □ 

Hence M~{y\v) depends on the error distribution e 7 only through the v-th 
moment of e 7 , i5[||e 7 || ,y ]. 

Remark 3.1. This relationship of Lemma 3.1 remains true under more general 
separable priors rj~/ 2 1 ir(a, (3~ ( ), which satisfies Kl and K2. Unfortunately we 
could not find any other priors, which lead to analytically tractable Bayes fac- 
tors under Gaussian errors, except for our sub-harmonic priors ply (a, /3 7 , rj^\u) 
given by (2.9). It follows that for any such separable prior, the distributional 
robustness of the Bayesian variable selection procedure we develop for our sub- 
harmonic prior will carry over. That is, whether MCMC or another computa- 
tional method is used, it suffices to assume that the error distribution is Gaussian 
(with the same error variance) for all submodels, but the procedure will be si- 
multaneously valid for all SS error distributions. Our prior has the advantage 
of analytic tractability, and as we show below, good performance. 

We will make use of the following result which may be founded in Liang et al. 
(2008). 

Lemma 3.2. Let < v < q 1 . Then 

G A _ n 1/2 r({n- l}/2) f°° fl "/2-i(i +ff )(n-?-r-i)/2 



M ?(.V\ V ) = -n --. ||„-i T _l)/2 / , ,(n-l)/2 d 9> ( 3 - 5 ) 

||y-yl„||" "" ( >' Jo { g (i - R?) + i} ( ]l 

where i? 7 is the coefficient of determination under the submodel M. 1 . 

See equation (5) of Liang et al. (2008). Combining Lemmas 3.1 and 3.2, we 
have the main result of this paper. 

Theorem 3.1. Assume the full model Mp and the submodel M.~ i are given by 
(1.1) and (1.2), respectively. Also assume their error terms, ep and e 7 have the 
same SS distribution (1.3) with mean zero and the identity covariance matrix. 
Let < v < q~y. Assume that the proper joint prior densities of (a, fip, Vf) 
and (a,/3 7 ,77 7 ) are given by (2.8) and assume also E[\\e~ f \\ 1 '] < oo. Then, for 
.M 7 ^ M. n , the limit of the Bayes factor for comparing each of M. 1 to the full 
model Mp is given by 



BF 7:F H = Hm m ^ h J = BF G {v) (36) 
h.->oo mp[y\v; h) 



where 



Jo 9 2 i 1 + 9) 2 {g(^-Rp) + ^} 2 dg 
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Remark 3.2. Maruyama and George (2010) considered Bayesian variable selec- 
tion under Gaussian errors. They proposed Bayes factors with a simple analytic 
form under generalized ridge-type priors. The results heavily depend on spe- 
cial features of Gaussian distributions and hence the extension or generalization 
of Maruyama and George (2010) to the general SS case, may not be possible, 
or may not lead to analytically tractable procedures which are distributionally 
robust to SS error distributions. 

Remark 3.3. Expression (3.7) again shows why the null model is not allowed 
as a possibility. For the null model A4n, R% = so the numerator of (3.7) 
is infinite, and hence so would be BF^. F (i^). This situation may be avoided 
at a slight cost in complexity and in interpretability of the expressions. The 
required alteration in the prior distributions (proper and improper) is to treat 
the intercept parameter a as another j3, (and not give it a "uniform" prior). 
This results in replacing the improper prior in (2.9) by 

p 7 (/3 7 ,?7» = \ iJ ^X^^'^X^)-"^^- 1 , 

2 2 7T 2 

where /3 7 = (a,/3 7 )' and X 7 = (1„|X 7 ). Similarly the marginal distribution in 
(3.5) and the Bayes factor given by (3.7) are replaced by 

,VC,..,.a _ r(»/2) [°° ^/2-l(l + ff )(n- 9T -l)/2 



and 



M 7 ^ W = II in r/2 / ~, ~^ W2~ d ^ 



BF G M = Jr9*-Hl + 9)-+-W-fy + l}-*dB g) 
T 0^(1 +2)^3(1 -^) + ir f V '"" 

where R 2 = 1 — RSS 7 /||y|| 2 , (the "coefficient of determination" of the model 
Ai 7 relative to the 0- intercept model). Hence with the substitution i? 7 — > R? ( , 
n — 1 n, q 1 —} q 1 + 1, y — yl n —> y, all expressions and results in the paper 
remains valid and the result (Corollary 5.1) on model selection consistency in 
Section 5 holds also for the null model. Clearly R 2 is somewhat unusual, but 
if model selection consistency under the null-model is desirable, we can use 

w G 

BF f (f). Actually, under the Gaussian regression setup, Guo and Speckman 
(2009) and Celeux et al. (2011) recommend use of the Bayes factor as a function 

of .R 7 , which just substitutes g u ^ 2 ~ 1 with 1/(1 + g) in BF 1 . F (v) given by (3.8). 

Remark 3.4. A collection of Bayes factors is called coherent if 

BF 7i:72 BF 7i:7q BF 7q:72 , and BF 7i:72 — l/BF 7a:7l , 

for all 71 and 72 (see, e.g. Robert (2007)). By (3.6), the Bayes factors corre- 
sponding to our sub-harmonic priors are coherent (with the exception of those 
involving the null model Mn)i which is why we require irpj = 0. Also with the 
adaptation of the alternative specification given in Remark 3.3, coherence holds 
for all Bayes factors including those involving the null model Mn- 
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As in Zellner and Siow (1980) and Liang et al. (2008), the posterior proba- 
bility of any model is an expression of the form (1.5) given by 

Pr(M» = v 7 7 ■ 
2^ 7 7r 7 -Dr 7 :F 

In our development, we choose the full model M.p as the base model rather than 
the null model Mn employing the encompassing approach of Zellner and Siow 
(1980). Without employing the adaptation of our prior described in Remark 
3.3, the choice of the full model M.p as the base model, as opposed to the null 
model M.n, is forced on us since itn = 0. Liang et al. (2008) argue that the null 
model is the superior choice as a base model in their setup (which also involves 
g-priors or mixture thereof) due to incoherence of the Bayes factors if Mp is 
chosen as the base model (in their setup). This incoherence arises in the setup 
of Liang et al. (2008) because prior distribution on the full model M. p depends 
on each nested alternative M y . This incoherence is not a problem in our setup 
since the choice of prior for each submodel depends only on the submodel, and 
we have taken care that all relevant (conditional) posteriors are well defined. As 
noted above, by (3.6) our Bayes factors are coherent. In fact our development 
(aside from eliminating the null model from the consideration) is very close in 
spirit to the null-based Bayes factors approach in Liang et al. (2008). 

Remark 3.5. By Theorem 3.1, even when there is no specific information about 
the error distribution of each model (other than spherical symmetry), but we 
assume they are all the same, it is not necessary to specify the exact form of the 
sampling density. It suffices to assume they are all Gaussian. As far as we know, 
in the area of Bayesian variable selection with shrinkage priors, the sampling 
density has been assumed to be Gaussian and this kind of robustness result has 
not yet been studied. Analogous robustness results have been derived by Cellier, 
Fourdrinier and Robert (1989), Maruyama (2003), Maruyama and Strawderman 
(2005) and others in the problem of estimating regression coefficients with the 
Stein effect. 



3.1. BIC under spherically symmetric error distributions 

BIC (Schwarz (1978)) is a popular criterion for model selection. See e.g. Hastie, 
Tibshirani and Friedman (2009) Chapter 7. We will show in this subsection that 
BIC has a similar distributional robustness property to the above Bayes model 
selection procedure. In Section 4, we will develop Laplace approximations to our 
Bayes factors which relate them to BIC. In Section 5 we will show that both 
the BIC and our Bayes model selection procedures are consistent for the entire 
class of SS models. 

BIC for the model A4 7 is defined as 

[BIC] 7 = -21n( max V ^ 2 f ( Vl \\y - al n - X 7 /3 7 || 2 ) n^*/ 2 ) , (3.9) 
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and is derived by eliminating 0(1) terms from the approximate marginal den- 
sities. Here we denote 

M 7 (y|BIC) =exp(-[BIC] 7 /2). (3.10) 

In general, the maximization with respect to unknown parameters in (3.9) is 
not always tractable. However when e 7 has a unimodal SS distribution, the 
maximization is achieved by a = y, /3 7 = (X' X^) -1 X' y, and 

1/t) 7 =c\\y- al n - X 7 /3 7 || 2 = c\\y - yl„|| 2 (l - R 2 ) (3.11) 
where c is the sole solution of 

n/2 + cf'(c)/f(c) = 0. (3.12) 
Hence M 7 (y|BIC) may be expressed as 

-n/2 ff \ 

M 7 (y|BIC) = 1 U M G (y]BIC) (3.13) 

c G Jg[cg) 

where M G (y|BIC) is M 7 (y|BIC) with Gaussian errors, specifically 



M G (y|BIC) = c G n/ 'f G (c G ){\\y - yl„|| 2 (l - R 2 )y n l 2 n^' 2 
= n- n / 2 f G (n){\\y - yl n \\ 2 (l - i? 2 )}-"/ 2 n^/ 2 



(3.14) 



(since c G is given by n). Clearly (3.13) and (3.14) correspond to (3.4) and (3.5), 
respectively. Hence we have the following result. 

Theorem 3.2. Assume the full model Aip and the submodel A^ 7 are given by 
(1.1) and (1.2), respectively. Also assume their error terms, ep and e 7 have a 
unimodal SS distribution (1.3) with the mean zero and the identity covariance 
matrix. Then the Bayes factor based on BIC for comparing each of My to the 
full model Mp is given by 



where BF F [BIC] is the BIC based Bayes factor under Gaussian 



errors, 



1/2 



Obviously (3.16) corresponds to (3.7). By Theorem 3.2, the Bayes factor 
based on BIC is also independent of the error distribution provided each distri- 
bution is unimodal and is the same for all models (c.f. Theorem 3.1). Note that 
BF G :F [BIC] is well defined if 7W 7 = M N . 
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Remark 3.6. In an earlier version of this paper we developed the results in 
the more general context wherein the SS distribution of e 7 ~ /7(||e 7 || 2 ) could 
depend on 7, i.e. it could be different for each submodel M. 1 . So ~ /f(||c_f|| 2 ) 
as well. All of the above results can be developed for the more general case. The 
only essential changes are that (3.6) in Theorem 3.1 becomes 

BF 7!j ,(i/) = IjJI^ BF^H (3.17) 

and that (3.15) in Theorem 3.2 becomes 

BF 7:F [BIC] = g^frl BF g f[B ic] (3.I8) 
c F Jf{cf) 

where c 7 and cp are the corresponding solution of (3.12) respectively. We inves- 
tigated the ranges of these "correction terms" in (3.17) and (3.18) respectively 
when e 7 and ep have possibly different SS ^distributions with at least 3-degrees 
of freedom (so that the variances exist), both analytically and numerically. We 
found that BF 7: ^(i/) was independent of n and reasonably stable for all v in 
range (0, 1) but that stability was greater for v close to 0. This trade-off between 
stability (favoring v 0) and objectivity (favoring larger v) led us initially to 
prefer the midpoint of the allowable values in (0,1), namely v = 1/2 as the 
default choice. However the examples presented in Section 6 indicate that the 
performance of the method seems insensitive to the choice of v in the range of 
(0, 1). The correction factor for BF 7: ^[BIC] on the other hand depends on n and 
is considerably less stable to changes in the distributions of e 7 than BF T ir(i/). 

It is interesting to note in connection with the above that the correction term 
^'[|l e 7l| l/ ]/-^'[|l e i ;, l| !/ ] f° r BF 7: i?(^) approaches 1 as v — > 0. Hence choices of v close 
to are essentially completely robust to choice of SS error distribution for the 
submodels. 

Note also that if we force 7r 7 = for all submodels such that q 1 < 2, then the 
allowable range of v is (0, 3) and hence v = 2 becomes a possible choice. In this 
case, again, the correction term i?[||e 7 || ly ]/i?[||e^|| ly ] = 1 regardless of the choice 
of error distributions, since we have assumed the variance of each component is 
1. Hence, again, in this case, the Bayes factor is completely robust to choice of 
SS error distribution. Additionally, the case v — 2 corresponds to the harmonic 
prior 

p 7 (a,/3 7 ,r/»cx ||0 7 || 2 -^' 

where 7 = (X 7 X 7 ) 1 / 2 /3 7 and q 1 > 3. It is well-known that the harmonic 
prior plays important roles in estimation problems with the Stein effect. See 
Maruyama (2003) for the detail. It is interesting to observe the additional ad- 
vantage of the harmonic prior in the model choice problem. See Section 7 for 
some additional discussion of such priors. 



Y. Maruyama and W. Straw derman/B ayes factors 



14 



4. The Laplace approximation of BF under Gaussian errors 

In Section 3, we saw that the Bayes factor BF 7: jr(i / ) under SS errors is equal 
to BF^. F (z/), which is the Bayes factor under Gaussian errors. In this section, 
we consider the so-called Laplace approximation of some Bayes factors under 
Gaussian errors. We will approximate not only the function ~BF^. F (v) but also 
Bayes factors with respect to more general priors where the prior on g is (2.12); 

Pg (g\{v, k}- ^) ex g^-\l + g- 1 )-^ 2 . (4.1) 

When the same prior on g is used for A^ 7 and Mf, improper choices of v 
(0 < v < q 7 ) as well as proper choices of v (— k < v < 0) are valid for use. 
Under Gaussian errors, the Bayes factor for comparing each of A4 7 to M.p is 
well-defined as 

BY% F [v, k] 

= £ g^-Hi + g-r^q + gf^zjgQ- - g) + iTg d 9 (4.2) 

/ °° g*/2-l(l + g-l)-*/2(l + g)^{g(l - R p ) + 1}"^ d g 

where k > 0, ~k < v < q 1 . 

First we provide a summary of Laplace approximations to the integral based 
on Tierney and Kadane (1986). For integrals of the form 

exp(/i(r, n))dr, 

we make the use of the fully exponential Laplace approximation, based on ex- 
panding a smooth unimodal function h(r, n) in a Taylor series expansion about 
f, the mode of /i(r, n). The Laplace approximation is given by 

f-oo exp(/i(r, n))d,T 

lira , „ -— — -T- = 1 (4.3) 

n^oo (2tt) l > z <ji 1 exp(n(r, n)) 

where 

d 2 h{T,n) ^ 1/2 



dr 2 

In the following, we will use the symbol f(n) « g(n) (n — > 00) if 

lim = 1. (4.4) 
n->oo g(n) 

Hence the approximation given by (4.3) is written as 

exp(/i(T, n))dr « (27r) 1 / 2 <T/ 1 exp(/i(f,n)), (n ->• 00). (4.5) 

The next result gives approximations of the Bayes factor (3.7) in terms of 
the Bayes factor based on BIC given in (3.16). 
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Theorem 4.1. Let the prior be given by (4.1). Assume that {v, k} does not 
depend on n. 

1. Assume —k < v < q, and < r < 1. Then 

^- (l^W 4 ^^}" 3 , (4-6) 



o (1 + g-i)*/ 2 (l + rg)^ I n«-» 

where (fi(s, r) = rs s ~ {(1/r — l)e}~ s . 
2. Assume that —k < v < q 7 and that {i/, k} does not depend on M.^. Assume 



G 



also < R'i < Rp < 1. Then BF G F [u, k] w BF F {v) wh 



•_G / s I ip{q 1 -v,l- R*) 
(p(p -v,l- R 2 F ) 




ere 



BF 7:f(^) = { J! J; } BF^ :F [BIC] (4.7) 



and BF^. F [BIC] is the BIC based alternative under Gaussian errors. 

Proof. See Appendix. □ 

Clearly the function tp does not depend on n and hence Theorem 4.1 shows 
that BF^. F [i>, k] is asymptotically equivalent to BIC with a simple 0(1) correc- 
tion function depending v as well as {p, q 7 } and the i?-squares. Although several 
fully Bayes factors for the variable selection problem have been proposed in the 
literature, the relationship between the approximate Bayes factors and naive 
BIC has not been shown to the authors' knowledge. In this sense, while the 
main contributions in this paper are given in Section 3, Theorem 4.1 may be 
a practically useful contribution because of the simplicity of the approximate 
Bays factor. 

In this section, we have considered general Bayes factors under Gaussian 
errors. Remember that BF^ : p(v) the Bayes factor w.r.t. sub-harmonic priors, 
under SS errors, is equal to BF^. F [v, 0] for < v < 1. Under Gaussian errors, 
Liang et al. (2008) recommended the use of BF^. F [z;, 2 — v] with —2 < v < 
0. Guo and Speckman (2009) and Celeux et al. (2011) recommended the use 
of BF^. F [0,2]. By Theorem 4.1, these Bayes factors may be approximated as 
follows. 

Corollary 4.1. 

G ~ G 

BF T . F {v) — BF tF \v, 0] ~ BF F {v) for < v < 1. sub-harmonic prior 



G ~ G 

BF rF [0, 2] rj BF tF (0). a version of Guo and Speckman (2009) 
BF^ F [^, 2 - v] w BF F (v) for - 2 < v < 0. Liang et al. (2008) 



. Q 

In Section 6, we will see how approximate Bayes factors BF 1 . F {y) work nu- 
merically and how sensitive they are to to the choice of v. 
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5. Model selection consistency 

In this section, we consider model selection consistency in the case where p is 
fixed and as n approaches infinity. Let A4t be the true model, 

y = a T l n + X T (3 T + cr T e. 

Consistency for model choice is defined as 

plimPr(M T |</) = 1, 

n— >oo 

where plim denotes convergence in probability and the probability distribution is 
the sampling distribution under the true model M.t- We will show that Bayes 
factors considered in the previous sections has a model selection consistency 
under generally SS errors. The consistency property is clearly equivalent to 

BF 

plim BF rT = plim rF =0 V 7 + T. (5.1) 

n->oo n-s-oo Br T:F 

For model selection consistency, we make the following assumptions; 

Al. U n = ||e|| 2 /n is bounded in probability from below and from above, that 
is, for any c > and any positive integer n, there exists an M such that 

Pr (M _1 <U n < M) > 1 - c. 



A2. The limit of the correlation matrix of xi, . . . , x p , linin^oo X' F Xp/n, exists 
and is positive definite. 

Al seems more general than necessary. It appears that, by the law of large 
numbers, U n ought to converge to 1 in probability, but this is not necessarily 
true if the error distribution is not Gaussian. In the case of a scale mixture of 
Gaussians, U n approaches, in law, a random variable £ which has the distribu- 
tion of the mixing variable of the variance. Even when the error distribution is 
not a scale mixture of Gaussians, Al appears to be a reasonable and minimal 
assumption. A2 is the standard assumption which also appears in Knight and 
Fu (2000) and Zou (2006). Under these mild assumptions, we have following 
preliminary results for proving the consistency. 

Lemma 5.1. Assume Al and A2. 

1. Assume M.^ ^ Mn- For any < k < 1 and any positive integer n, there 
exists a 0^(7, k) > 2 such that 

p i^ ) < a ' <1 -^m) >1 - L (5 - 2) 

2. Let •y^T. Then (1 - R 2 T )/{1 - R? ( ) > 1. Further for any < k < 1 and 
any positive integer n, there exists a €2(7, T, k) > such that 

l-Rp n 
•2 

'7 • 



Pn^lT^ig) < l + ca(7,T,fc) ) >l-k. (5.3) 
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3. Let 7 ^ T. Then for any < k < 1 and any positive integer n, there exists 
a €3(7, T, k) > 1 such that 

Proof. See Appendix. □ 

First we give a consistency result on BIC. 

Theorem 5.1. Assume Al and A2. The Bayes factor based on BIC under 
Gaussian errors 



BF^BIC] = 



(l - ij2)- 7l rr*> 1 1/2 

(1 - R 2 F )- n n-P J 



is consistent for model selection under SS errors (including A4 7 = A4n )■ 
Proof, we show that 

BF^BIC] _„^L 9T _^1-^ X ^ 1/2 

71^- OO 



Consider the following two situations: 

1. 7 D T: By part 2 of Lemma 5.1, {(1 - R 2 T )/{1 - R 2 ,)} 71 is bounded in 
probability. Since q 7 > qr, (5.5) is satisfied. 

2. 7 ^ T: By part 3 of Lemma 5.1, (1 - R 2 T )/{1 - i^) is strictly less 
than 1 in probability. Hence {(1 — i?y)/(l — R^)} n converges to zero in 
probability exponentially fast with respect to n. Therefore, no matter what 
value qT — q~/ takes, (5.5) is satisfied. 

These complete the proof. □ 

Note that in Theorem 5.1 we do not exclude the null model A4n and hence 
BIC has model selection consistency even when the null model is true. When 
we consider consistency of the Bayes factors treated in the previous sections, 



G 

? 

7: 

they all still have model selection consistency among non-null models 



BF 7 j(i/), BF^ :F [^, k], BF 7 . F (^), we have to exclude the null model Mn, but 



Corollary 5.1. Assume Al and A2. {v, k} is assumed independent of n and 
Mj. Assume also Mn is excluded from possible models. Then 



~G 

1. BF . F (i>) for v < 1 is consistent for model selection under SS errors. 

7 

errors 



2. 'SF 1 . F [v, k] for —k < v < 1 is consistent for model selection under SS 
errors. 

3. BF 7:F (j/) for < v < 1 is consistent for model selection under SS errors. 
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Proof. By part 1 of Lemma 5.1, when M. 1 ^ M-N-, both R and R F are positive 
and strictly less than 1 with probability 1. Hence both <p(q-y — v 1 1 — i? 7 ) and 
(p(p — v, 1 — Rp) where 

tp{s,r) =rs s ~ 1 {(l/r-l)e}- s 

are positive and bounded from above with probability 1 provided v < 1 and v is 
independent of n and A4 7 . (On the other hand, since R 2 N = 0, <p(q<y — v,l — R 2 N ) 
is not defined.) As in Theorem 4.1, 



1/2 



<p(p - M 




, £ 

Hence consistency of BF F (zv) follows from consistency of BIC. 
Further as n — > oo, we have 



BF£>,*]«BF° F (i/) 

by Theorem 4.1 provided — fc < ^ < 1 and {v, k} are independent of n and A4 7 . 

Hence consistency of BF^. F [v, k] follows from consistency of BF ri? (z/). 

Remember that BF 7: i?(V), the Bayes factor w.r.t. sub-harmonic priors, under 
SS errors, is equal to BF 7 . F [^, 0] for < v < 1. Hence consistency of BF 7:F (z/) 
follows from consistency of BF 7 . F [t/, k], □ 

Remark 5.1. Liang et al. (2008) established model selection consistency for 
v < and k = 2 — v for Gaussian errors. Corollary 5.1 in conjunction with 
Theorem 4.1 extends their result to the entire class of SS distributions for a 
broader class of v and k. Also as noted in Remark 3.3, BF 7:F (^) results in 
model selection consistency for all models including the null model for all SS 
distributions. Additionally a development along the lines of Theorem 4.1 allows 
an analogous BIC based approximation to BF rF (^). 

This extension also allows an extension of model selection consistency to all 
SS error distributions for the method of Guo and Spcckman (2009); Celeux et al. 
(2011) based on R 2 . Further as noted above, an alternative robust sub-harmonic 
prior based method, the model selection consistency for these R 2 based methods 
also apply for the null model. 

It should be emphasized in each of the above cases that is is the Bayes factor 
method developed for the Gaussian case that is shown to have model selection 
consistency for the entire class of SS error distributions. These Gaussian based 
Bayes factors, however, are not Bayes factors for error distributions which are 
not Gaussian, the sole exception being our robust Bayes factors which are based 
on separable priors in the sense described earlier, and which are simultaneously 
Bayes factors relative to the same prior. 

Remark 5.2. The issue of model selection consistency in our setup, is somewhat 
complicated by the wide choice of possible error distributions. If all errors are 
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normally distributed, then under our assumptions A2 on the design matrix 
Xp, imply that each approaches a constant, and that ||e|| 2 /n — > 1. If on the 
other hand, all models are variance mixtures of Gaussians with mixture variance 
distributed as a positive random variable £, then ||e|| 2 /n — > £ a random variable, 
and i? 2 also approaches a random variable which is bounded above and below 
in probability provided that £ is similarly bounded. 

In general philosophical terms, it might be better to assume that the se- 
quence of error terms e — (ex, ... , e n )' are exchangeable for all n. By De finetti's 
Theorem, this would imply that the error terms all have a variance mixture 
of normal distributions. We have chosen a slightly weaker requirement on the 
sequence of error distributions, namely, that ||e|| 2 /n remains bounded above 
and below in probability, which extracts the necessary limiting behavior of the 
error terms to ensure consistency of model selection. Interestingly, although we 
attain model selection consistency with these assumptions, it is not necessarily 
true that 1 = vare^ = var£ is consistently estimated by ||e|| 2 /n. 



6. Examples 

In this section, we provide illustrations of the method using both simulated 
and real data. In each example, we compare several different versions of the 

, Q 

Laplace approximated Bayes factors BF 7 . F (V) and BF^. F [BIC]. The values of v 
are —2, —1,0,0.5,0.95. These choices correspond to our default choice, v = 1/2 
and v = 0.95 which also satisfies our robustness condition < v < 1. The choice 
v = approximates BF rF [0, 2] of Guo and Speckman (2009) and v — —2 and 
v = — 1 approximates two choices of Liang et al. (2008) as presented in Corollary 
4.1. 

6.1. Simulation Studies 

We compare numerical performance of our with BIC in a small simulation study. 
We generated 16 possible correlated predictors (p = 16) as follows: 

cor— 0.5 cor— 0.3 cor— 0.1 

x^x 2 , x 3 ,x 4 ,x 5 ,x 6 , x 7l x$ ,x^xTo~ N(0, 1) 

cor=— 0.4 cor= — 0.2 

Xu, £12, Xi 3 , X14, £15, £16 - AT(0, 1). 

Here "cor" denotes the correlation of two Gaussian random variables. Also 
(xi,x 2 ), (x 3 ,X4,), (x 5 ,x 6 ), (x 7 ,x s ), (x 9 ,xio), ^u, %12, Xi3, ^14, %15, 2?i6 are as- 
sumed to be independent. After generating pseudo random x±,..., xie, we cen- 
tered and scaled them as noted in Section 1. We set n = 30 and consider 4 cases 
where the true predictors are 

q T = 16 Xl, X2, x 3 , Xi, x 5 , x 6 , x 7 , x$, x 9 ,x w , Xu,X 12 ,Xi 3 , x u , Xi 5 , x 16 
q T = 12 xi, x 2 , x 3 , Xi, x 5 ,x 6 , x 7 , x s , x 9 ,x w , Xu, x 12 
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<7T = » %l,Xl, X 5 ,X 6 , X 9 ,X 10 ,Xn,Xi2 

q T = 4 Xi,x 2 , x 5 ,x 6 

(where qt denotes the number of true predictors) and the true model is given 
by 

y = l 30 + 2 ^ Xi + a X I y (6.1) 

ie{ tue} \Multi-t(0,J 3 o;3,30), 

with a — 0.5, 1, 2. Tables 1 and 2 show how often the true model ranks first 
and how often it is in the top 3 among 2 16 — 1 candidates when the number 
of replicates is N = 200. The error distributions are Gaussian (Table 1) and 
multivariate-^ with 3 degrees of freedom (Table 2). For the case of normally 
distributed errors (Table 1), the Bayes factor methods performed well and stably 
for a — 0.5 and er = 1 and did reasonably well for a = 2 for the smaller true 
models (qx = 4, 8). BIC seemed, generally, to have a preference for larger models, 
and performed much less well than the Bayes factor method for a — 0.5 and 
a = 1 for models of smaller size (qr = 4, 8, 12) For a = 2, BIC did substantially 
better than BF for the largest model (qT = 16) and somewhat better for q^ = 12. 

, Q 

Performance of BF i: p(v) seemed relatively insensitive to the choice of v. When 
qT 7^ 16, the choice of v makes little difference. But when qT = 16, positive 
v = (0.5, 0.95) seems to perform better especially for larger a. 

Interestingly, for the case of a multivariate-^ error distribution with 3 degrees 
of freedom (the minimum so that a variance exists), the numerical results were 

, Q 

quite similar to those in the normal case for both BF rF (^) and BIC, both 
quantitatively and qualitatively. One possible aspect of the relative insensitivity 
of the results to choice of v in heavy tailed case is the extension of model selection 
consistency for the entire class of SS errors to a broad class mixture of g-prior 
based methods given by Corollary 5.1. 



6.2. Analysis of real data 



In this section, we apply our methods (approximate Bayes factor and BIC) to 
Hald data set presented and analyzed in Casella and Moreno (2006) and to the 
US Crime data set in Raftery, Madigan and Hoeting (1997). See those papers 

for detailed descriptions of the data sets. Table 3 and 4 present posterior proba- 

— G 

bilities based on BF (v) of the top three selected models (assuming equal prior 
probabilities on all models) for several different choices of v (0.95, 0.5, 0, —1, —2). 

BIC was also included in the study. In each case, the first, second and third 

~G 

ranked choices based on BF (y) were identical regardless of the choices of v. 

Q 

Also in each case the top ranked submodel based on BF (y) was regarded as 
reasonable in the earlier papers. As in the simulation study, and as noted in 
several previous studies, BIC seems to choose bigger models. In particular for 
the Hald data, the top choice {x\,X2] agrees with that of Casella and Moreno 
(2006) and also of Berger and Pericchi (1996) and Draper and Smith (1998). 
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Table 1 

Frequency of the true model ( Gaussian error) 



q T 


1 16 


12 




8 


4 


rank 


CO 


CO 


1 


1-3 | 


1 1-3 



a = 0.5 



BF G (0.95) 


1.00 


1.00 


0.94 


1.00 


0.94 


0.99 


0.89 


0.99 


BF G (0.5) 


1.00 


1.00 


0.95 


1.00 


0.94 


0.99 


0.88 


0.99 


BF G (0) 


1.00 


1.00 


0.95 


1.00 


0.94 


0.99 


0.88 


0.99 


BF G (-1) 


1.00 


1.00 


0.96 


1.00 


0.94 


1.00 


0.89 


0.99 


BF G (-2) 


1.00 


1.00 


0.96 


1.00 


0.94 


1.00 


0.89 


0.99 


BIC 


1.00 


1.00 


0.44 


0.62 


0.28 


0.46 


0.20 


0.35 



a = 1 



BF G (0.95) 


0.85 


0.92 


0.87 


0.99 


0.86 


0.97 


0.76 


0.95 


BF G (0.5) 
BF G (0) 


0.83 


0.90 


0.88 


0.99 


0.87 


0.97 


0.76 


0.94 


0.80 


0.87 


0.89 


1.00 


0.87 


0.97 


0.75 


0.94 


BF G (-1) 


0.73 


0.81 


0.89 


1.00 


0.88 


0.97 


0.74 


0.94 


BF G (-2) 


0.55 


0.72 


0.90 


1.00 


0.89 


0.98 


0.76 


0.95 


BIC 


1.00 


1.00 


0.44 


0.62 


0.28 


0.46 


0.20 


0.35 



a = 2 



BF G (0.95) 


0.06 


0.11 


0.26 


0.42 


0.50 


0.74 


0.51 


0.73 


BF G (0.5) 


0.05 


0.10 


0.25 


0.41 


0.51 


0.74 


0.50 


0.72 


BF G (0) 


0.05 


0.10 


0.24 


0.41 


0.52 


0.73 


0.49 


0.72 


BF G M) 


0.04 


0.06 


0.22 


0.39 


0.43 


0.74 


0.48 


0.72 


BF G (-2) 


0.02 


0.03 


0.17 


0.32 


0.47 


0.72 


0.43 


0.71 


BIC 


0.62 


0.77 


0.31 


0.48 


0.24 


0.40 


0.19 


0.34 
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Table 2 

Frequency of the true model (multi-t error) 



q T 


1 16 


12 




8 


4 


rank 


CO 


CO 


1 


1-3 | 


1 1-3 



a = 0.5 



BF G (0.95) 


0.94 


0.95 


0.92 


0.98 


0.85 


0.97 


0.84 


0.96 


BF G (0.5) 


0.94 


0.95 


0.92 


0.98 


0.86 


0.97 


0.84 


0.96 


BF G (0) 


0.93 


0.95 


0.93 


0.98 


0.86 


0.97 


0.84 


0.96 


BF G (-1) 


0.93 


0.94 


0.93 


0.98 


0.87 


0.97 


0.83 


0.96 


BF G (-2) 


0.91 


0.92 


0.95 


0.98 


0.89 


0.98 


0.84 


0.96 


BIC 


0.98 


0.99 


0.47 


0.63 


0.28 


0.45 


0.26 


0.38 



a = 1 



BF G (0.95) 


0.67 


0.70 


0.75 


0.84 


0.72 


0.88 


0.70 


0.85 


BF G (0.5) 
BF G (0) 


0.64 


0.68 


0.76 


0.84 


0.71 


0.88 


0.70 


0.85 


0.62 


0.67 


0.77 


0.84 


0.72 


0.89 


0.69 


0.84 


BF G (-1) 


0.58 


0.63 


0.76 


0.83 


0.75 


0.89 


0.69 


0.84 


BF G (-2) 


0.49 


0.58 


0.76 


0.83 


0.75 


0.89 


0.69 


0.84 


BIC 


0.89 


0.93 


0.45 


0.60 


0.27 


0.44 


0.25 


0.37 



a = 2 



BF G (0.95) 


0.14 


0.20 


0.28 


0.37 


0.37 


0.51 


0.45 


0.58 


BF G (0.5) 


0.14 


0.18 


0.29 


0.36 


0.37 


0.51 


0.45 


0.57 


BF G (0) 


0.13 


0.17 


0.29 


0.36 


0.38 


0.51 


0.45 


0.57 


BF G M) 


0.09 


0.13 


0.29 


0.33 


0.38 


0.51 


0.44 


0.56 


BF G (-2) 


0.07 


0.11 


0.24 


0.34 


0.35 


0.49 


0.40 


0.55 


BIC 


0.47 


0.59 


0.28 


0.39 


0.21 


0.32 


0.20 


0.30 



Y. Maruyama and W. Straw derman/B ayes factors 



23 



Table 3 

Hald data: posterior probabilities of top 3 selected models 

BF^ 

v 0.95 0.5 -1 -2 

1 {1,2, } 0.66 0.63 0.61 0.57 0.54 

2 {1, 4} 0.16 0.17 0.17 0.18 0.20 

3 {1,2, 4} 0.06 0.07 0.07 0.08 0.08 

BIC 

1 {1,2, } 0.25 

2 {1,2, 4} 0.23 

3 {1,2,3, } 0.23 



Table 4 

US crime data: posterior probabilities of top 3 selected models 



v 0.95 0.5 -1 -2 

1 {1,3,4, 9,11, 13,14 } 0.020 0.019 0.018 0.016 0.015 

2 {1,3,4, 9,11, 13,14,15} 0.018 0.018 0.017 0.015 0.014 

3 {1,3, 5,9,11, 13,14 } 0.013 0.013 0.012 0.011 0.010 

BIC 

1 {1,3,4, 9,11, 13,14,15} 0.035 

2 {1,3,4, 9,11, 13,14 } 0.026 

3 {1,3,4, 9,11,12,13,14,15} 0.019 



For the US Crime data, our top ranked model agrees with that of the Oc- 
cam's window posterior in Table 2 of Raftery, Madigan and Hoeting (1997). 
Interestingly our second ranked model includes x\§ which does not occur in any 
of Raftery, Madigan and Hocting's (1997) Occam's window model choices, but 
which does occur in several models chosen by such classical methods as Mallow's 
C p , adjusted R 2 , etc. in their Table 1. 

7. Concluding remarks 

Bayesian model selection for linear regression models with Gaussian errors has 
been popular area of study for some time. There is also a substantial litera- 
ture devoted to studying the extension of Stein-type shrinkage estimators from 
models with Gaussian errors to those with general SS errors. In particular, it 
has long been observed that certain shrinkage estimators which improve over 
the least squares (LS) estimator for Gaussian models also improve over the LS 
estimator simultaneously for all SS error models (See for example, Collier, Four- 
drinier and Robert (1989)). Maruyama (2003) and Maruyama and Strawderman 
(2005) found, in addition, that certain separable priors (in the sense described 
in Section 2) leads to generalized Bayes shrinkage estimators that do not de- 
pend on the form of the underlying SS distribution and that also simultaneously 
improve on the LS estimator, sometimes dramatically so. The original aim of 
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this research was to see if similar separable priors could be found that have 
this distributional robustness property in the variable selection problem (and, 
of course, also to perform well). 

The generalized Bayes priors developed in sections 2 and 3 turned out to 
satisfy our requirements and also to be closely related to other so called g- 
priors (or mixtures of g-priors) in the literature (See e.g. Liang ct al. (2008); 
Guo and Spcckman (2009)). Our original development required that the null 
model be excluded from the class of possible models, however we observed (see 
Remark 3.3) that a slightly altered version which treats the intercept term 
the same as all other regression coefficients (as opposed to giving it a uniform 
prior), allows the null model to be included as well. This alternative development 
gives Bayes factor which depend on the R 2 relative to the null model where all 
coefficients including a is 0. This dependence on R 2 is related to the model 
selection procedures of Guo and Speckman (2009); Celeux et al. (2011). 

Our prior distribution on the regression parameters is sub-harmonic for each 
(non-null) model, and as such, leads to admissible estimators under quadratic 
loss in each of the non-null models when the variance is known (see Maruyama 
and Takemura (2008)) regardless of the SS error distribution. Also if q > 3 and 
v = 2 the generalized Bayes estimator for each submodel is minimax and dom- 
inates the James-Stein estimator (See Maruyama (2003)). Hence although we 
require < v < 1, we nevertheless expect good performance of the correspond- 
ing Bayes estimators. 

The expression of our Bayes factors, e.g. (3.6), are relatively simple involving 
the ratio of two 1-dimensional integrals. To further simplify calculations we 
investigated Laplace approximations to our Bayes factors, and more generally, to 
a collection of Bayes factors arising from mixtures of g-priors that have recently 
appealed (See Liang et al. (2008); Guo and Speckman (2009)). We show in 
Section 4 that in each case the Bayes factor can be approximated as the Bayes 
factor for the Gaussian model based on BIC times a simple rational function 
depending q, v and the R 2 of the models. 

Using these Laplace approximations we are able to establish model selection 
consistency of our robust procedure for the entire class of SS distributions and 
to extend the model consistency results of several earlier papers for the Gaussian 
case to the entire class of SS distributions. 

A small simulation study and an analysis of the Hald data (See Casella and 
Moreno (2006)) and the US Crime data (See Raftery, Madigan and Hoeting 
(1997)) indicates that our method performs well. It gives results consistent with 
the results of the cited papers for the real data sets and performs comparably 
and sometimes better than several of the mixture of g-prior methods. 
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Appendix A: Proof of Lemma 3.1 

Under the submodel A4 7 , the conditional marginal density of y with respect to 
improper prior ?y^ 2 1 given a and /3 7 is 

[>00 

/ < /2 / (v-y\\V ~ «ln ~ X 7 /3 7 || 2 ) rf^-H^ 
Jo 

poo 

= \\y - al n - X 7 /3 7 ||-"- y / t^+^^f^dt 

Jo 

= ^ / fa (v\\v - aln X 7 /3 7 || 2 ) ^"V^dr, 

Jo 



f°° t ^+»y*-if G (t)dtj 



£[IMI1 



i/l roc 

»] Jo ^ /2fG ~ ^ ~ XM2 "> rt 2 ' 1 ^ 



E[\\e G \ 

where f G {t) = (2ir)- n / 2 exp(-t/2), provided 

^ t^+^^f^dt < oo & £[||e 7 f] < oo. (A.2) 
Jo 



Therefore, we have 

E[hc\\"] 



M M») = P ni' 7 !,i ^ G faH- (A.3) 



Appendix B: Proof of Theorem 4.1 

Denote the left-hand side of (4.G) by H(n). When approximating H(n), make 
the change of variables r = log g. See Liang ct al. (2008) for details. With this 
transformation, the integral becomes 

poo (y/2-l) T (i , T\{n-q-l)/2 

H{n)= / — ^L±l ^^ e r dT (B.l) 

J-oo (l + e -) fe / 2 {l + r e -} ( "^ 1)/2 

where the extra e T comes from the Jacobian of the transformation of variables. 
Denote the logarithm of the integrand function in (B.l) by h{r,n). We have 

9 w \ 1 z f( n _i)(i_r) v + k . A rnn . 
or 21 + z [ 1 + rz z J 

where z = e T . Since < r < 1 and v + k > 0, the equation {d/dr}h(T, n) = 
has the only one positive root z = e T . It clearly satisfies 

lim * = (B.3) 

n->oo n q — v 
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Hence we have 

e h(f.n) = + + $-l)-k(l + rz)- n+1 } 1/2 

/ n/{rz )^ 1 



n 

n{ ^T3I} ) r CX P tt 1 - Vr}n/I) 



1/2 

1~ V ) * ' r -n+l I 

n{l/r-l}e) j 
Similarly, as in (B.2), we have 

d 2 (d/dr)h(T,n) _ z 2 f(n-l)(l-r)r i/ + k 

W 1 (r ' n) " 1 + z ~ 2(1 + z)\ (1 + rz) 2 + z 2 

and 

d , , . , 5 (rj — — r)rz n 1- r q — v ,„ . 

^ T '"^ M - 2(l + i) (l + ri)» W -2i— (R5) 

Therefore we have 

f 4W ff -„ -1 1/2 (B.6) 



1 gr — i/ \n{l/r — l}e 

as n — > oo. Hence the part 1 of the theorem follows. 

Since BF^. F [v,k} in (4.2) is given by the ratio of such integrals, part 2 of the 
theorem follows. 



Appendix C: Proof of Lemma 5.1 

Let Mt be the true submodel y = axln + XxfiT + cFT^ where Xt is the nxqx 
true design matrix and /3t is the true (qr x 1) coefficient vector. 

For the submodel A4 7 , 1 — R 2 is given by ||Q 7 (y — yl„)|| 2 /||y — yln|| 2 with 
Q 1 = I — X~ l (X! y X~ f )~ 1 X! y . The numerator and denominator are rewritten as 

||Q 7 (y - yl„)|| 2 = \\Q- ( X T (3 T + a T Q 7 e\\ 2 

= [H'rpX'rpQ^XTftT + 2(7x/3y X^pQ-yE + cr^e'Q^e 

where e = e — el„ and similarly 

|| V - yl„|| 2 - /3^X^X t /3t + 2a T ff T X' T e + a 2 ||e|| 2 . 
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Since e'Q 7 e < ||e|| 2 , 1 — i? 2 is bounded as 

f3' T {X' T Q 1 X T /n}f3 T + 2a T f3' T {X' T Q 1 eln} + a^W 7 V n 



(3' T {X' T X T /n}(3 T + 2a T f3' T {X! r e/n} + cr^W 7 F r 
^ 1 D 2 ^ P' T {X' T QiX T /n}PT + 2a T (3' T {X' T Q 1 e/n} + <j 2 T V, 



/3^{X^X T /n}/3 T + 2/3^{X^e/n} + erf, 14 

where V n = e'e/n and W 7 = e'Q 7 e/||e|| 2 - Se({n - g 7 - l}/2,q 7 /2). In (C.2), 
we have the following. 

• Since E[e] = and Var[e] = I n , E[X' T e/n] = and 

var (X' T e/n) = n^iX^Xr/n} -> 0. (C.3) 

Therefore (3' T X' T e/n approaches in probability. 

• When 7 D T, Q 7 X T is a zero matrix. When 7 ^ T, /3^{X^Q 7 e/n} 
in probability can be proved as (C.3). 

• By the assumption A2, X' t Xt/tl — X^Q^Xx/n is positive-definite for 
any n and hence 

(3' T {X' T X T /n}p T > /3' T {X^Q y X T /n}/3 T , for (3 T ± 0. 

• Wj converges to 1 in probability. 

• By the assumption Al on e'e/n, V n is also bounded in probability from 
below and from above. 

Combining these facts, we see < i? 2 < 1 with strict inequalities in probability. 

Since Q y X T = for 7 D T and using (C.l), (1 - i?^)/(l - i? 2 ) is given by 
1 1 e 1 1 2 / 1 1 Q^,e 1 1 2 . Further we easily have 

l-R 2 T = IIQrell 2 ||e|| 2 = 1 
" 1-i? 2 ||Q 7 e|| 2 " ||Q 7 e|| 2 W 1 ' 

Note W 1 ~ Be({n - q y - l}/2,g 7 /2) is distributed as (1 + xlJxl-^-iT 1 
where Xn—q—i an< ^ Xq are independent. Hence 

{l+xlJxi-^-x} n = { l + { n lxl- qi -i}{x 2 q Jn)) " 
~ exp(— Xq y ) as n — > 00 

since Xn-^-i/" — > 1 in probability. Therefore W^T" is bounded in probability 
from above and hence the theorem follows. 



Y. Maruyama and W. Straw derman/B ayes factors 



28 



(1 - R^)/(l - Rl) is written as 



1 - i?2 [3' T X' T Q 1 X T (3 T + 2a T (3' T X' T Q 1 e + a^e'Q^e 

< (C4) 

- f3' T X' T Q 1 X T f3T + 2o- T f3' T X' T Q 1 e + a^e'Q^e y ' ' 

P' T {X^X T /n}l3T + 2a T /3' T {X' T Q 1 e/n} 

7 

Clearly W 1 -> 1 in probability. Also since 7 ^ T, f3' T {X^Q 7 X T /n}f3T > for 
any n. Further as {X^Q 7 e/n} — >■ in probability, (1 — i?^)/(l — i? 7 ) is strictly 
smaller than 1 in probability. 
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